Skip to content

[pull] main from triton-lang:main#1

Open
pull[bot] wants to merge 3453 commits into
maryamtahhan:mainfrom
triton-lang:main
Open

[pull] main from triton-lang:main#1
pull[bot] wants to merge 3453 commits into
maryamtahhan:mainfrom
triton-lang:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 30, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Apr 30, 2025
pawelszczerbuk and others added 28 commits April 16, 2026 07:29
TMEMCopy pattern is using stale tmem encoding. This may cause a crash in
the validator if the encodings mismatch.

---------

Co-authored-by: root <root@codex-gb200-0.brix.pawelszczerbuk.svc.cluster.local>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Adam P. Goucher <goucher@statslab.cam.ac.uk>
…9975)

fuseFuncArgVariables calls funcOp.getBody().front() on every LLVMFuncOp,
but external declarations (e.g @vprintf) have no body which causes the
pass to crash.

Reproducible on nvidia with LLVM_EXTRACT_DI_LOCAL_VARIABLES=1 on any
kernel that uses device_print.

Changes:
- Skip fuseFuncArgVariables for external declarations in
LLVMDILocalVariable
- Relax the isIntOrFloat() assert in both LLVMDIScope and
LLVMDILocalVariable to also accept VectorType, consistent with
convertType() which already handles it
- Add lit test
It doesn't make sense currently to have CLC in a multicta context
without multicast. As such, we hide this flag and we infer it
automatically.

The reason why it doesn't make sense it's because in multicta just one
CTA is allowed to request to cancel. In CUDA you can imagine patterns
like doing CLC without multicast and then share the result from one CTA
to all the others manually. We don't allow that in Gluon.
We also add tighter invariants for gather/scatter ops as well
Fix's API issues in the bench_mlp.py script.

`python/triton_kernels/bench/bench_mlp.py` no longer ran with current
Triton code.

Running:

```
  torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py
```
fails with 
```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 230, in <module>
[rank0]:     roofline_mlp(batch_sizes, 5760, 5760, 128, 4, dtypes[0], dtypes[1], ep, name="mlp_moe")
[rank0]:   File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 194, in roofline_mlp
[rank0]:     csv_path = roofline.compute_roofline(dim1, dim2, n_expts_tot, n_expts_act, parse_dtype(x_dtype),
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/triton_kernels/roofline.py", line 73, in compute_roofline
[rank0]:     perf = inject_proxy_and_call(val, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/triton_kernels/roofline.py", line 64, in inject_proxy_and_call
[rank0]:     return bench_fn(*args_list, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 100, in bench_mlp
[rank0]:     symm_mem_pool = SymmetricMemoryPool()
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: SymmetricMemoryPool.__init__() missing 1 required positional argument: 'mesh'
E0403 20:14:38.021000 2225 torch/distributed/elastic/multiprocessing/api.py:988] failed (exitcode: 1) local_rank: 0 (pid: 2258) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 367, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 1016, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 1007, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 184, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 332, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
python/triton_kernels/bench/bench_mlp.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-04-03_20:14:38
  host      : ab3ee0d0c408
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
```
We follow TMA load / store closely.
The main beneficial optimizations are:

1. Separate loader for weight and weight scales, allowing asymmetric
pipelining. Potentially we can have a separate partition for scale
factor load as well, but I didn't experiment with it.
2. Epilogue optimizations to decrease the number of instructions.
Epilogue instruction issue (especially with 8 warps) staves the other
warps of instruction issuing. Optimizing it improves overall
performance. I repurposed the 2 idle leftover warps as a store partition
to decrease the critical path in the epilogue.

gpt-oss-120b shapes, performance vs. `triton_kernels.matmul` on
synthetic "realistic" logits.

```
GPT-OSS-120B MoE MM1 E=128 EP=4 ES=8 B=2880x5760
Peak: 5 PFLOPS, 8 TBPS
batch_size                                            example                                          reference
----------------------------------------------------------------------------------------------------------------
       128        29.12 TFLOPS (  0.6%)    5.27 TBPS ( 65.9%)        22.74 TFLOPS (  0.5%)    4.11 TBPS ( 51.4%)
       160        36.67 TFLOPS (  0.7%)    5.96 TBPS ( 74.5%)        27.66 TFLOPS (  0.6%)    4.49 TBPS ( 56.2%)
       192        39.57 TFLOPS (  0.8%)    6.00 TBPS ( 75.0%)        28.94 TFLOPS (  0.6%)    4.39 TBPS ( 54.9%)
       224        52.64 TFLOPS (  1.1%)    5.99 TBPS ( 74.9%)        39.21 TFLOPS (  0.8%)    4.46 TBPS ( 55.8%)
       256        55.48 TFLOPS (  1.1%)    6.46 TBPS ( 80.7%)        40.17 TFLOPS (  0.8%)    4.67 TBPS ( 58.4%)
       320        67.89 TFLOPS (  1.4%)    6.64 TBPS ( 83.0%)        41.19 TFLOPS (  0.8%)    4.03 TBPS ( 50.4%)
       384        73.96 TFLOPS (  1.5%)    6.43 TBPS ( 80.4%)        45.74 TFLOPS (  0.9%)    3.98 TBPS ( 49.7%)
       448        91.18 TFLOPS (  1.8%)    6.64 TBPS ( 83.0%)        53.82 TFLOPS (  1.1%)    3.92 TBPS ( 49.0%)
       512        78.16 TFLOPS (  1.6%)    5.36 TBPS ( 67.0%)        60.78 TFLOPS (  1.2%)    4.17 TBPS ( 52.1%)
       640        90.66 TFLOPS (  1.8%)    5.17 TBPS ( 64.6%)        58.41 TFLOPS (  1.2%)    3.33 TBPS ( 41.6%)
       768        93.65 TFLOPS (  1.9%)    4.98 TBPS ( 62.3%)        61.17 TFLOPS (  1.2%)    3.25 TBPS ( 40.7%)
       896       103.66 TFLOPS (  2.1%)    5.04 TBPS ( 63.1%)        66.78 TFLOPS (  1.3%)    3.25 TBPS ( 40.6%)
      1024       135.60 TFLOPS (  2.7%)    5.16 TBPS ( 64.5%)        86.31 TFLOPS (  1.7%)    3.28 TBPS ( 41.0%)
      1280       131.21 TFLOPS (  2.6%)    3.80 TBPS ( 47.5%)       119.41 TFLOPS (  2.4%)    3.46 TBPS ( 43.3%)
      1536       175.65 TFLOPS (  3.5%)    4.26 TBPS ( 53.3%)       145.39 TFLOPS (  2.9%)    3.53 TBPS ( 44.1%)
      1792       181.42 TFLOPS (  3.6%)    3.85 TBPS ( 48.1%)       148.73 TFLOPS (  3.0%)    3.15 TBPS ( 39.4%)
      2048       188.32 TFLOPS (  3.8%)    3.96 TBPS ( 49.5%)       165.47 TFLOPS (  3.3%)    3.48 TBPS ( 43.5%)
      2560       205.57 TFLOPS (  4.1%)    3.45 TBPS ( 43.2%)       177.60 TFLOPS (  3.6%)    2.98 TBPS ( 37.3%)
      3072       263.31 TFLOPS (  5.3%)    3.62 TBPS ( 45.2%)       253.05 TFLOPS (  5.1%)    3.48 TBPS ( 43.5%)
      3584       283.21 TFLOPS (  5.7%)    3.38 TBPS ( 42.2%)       290.26 TFLOPS (  5.8%)    3.46 TBPS ( 43.2%)
      4096       348.60 TFLOPS (  7.0%)    3.61 TBPS ( 45.2%)       329.95 TFLOPS (  6.6%)    3.42 TBPS ( 42.8%)
      5120       421.37 TFLOPS (  8.4%)    3.57 TBPS ( 44.6%)       385.63 TFLOPS (  7.7%)    3.27 TBPS ( 40.8%)
      6144       524.65 TFLOPS ( 10.5%)    3.69 TBPS ( 46.1%)       473.72 TFLOPS (  9.5%)    3.33 TBPS ( 41.6%)
      7168       626.20 TFLOPS ( 12.5%)    3.79 TBPS ( 47.4%)       582.67 TFLOPS ( 11.7%)    3.53 TBPS ( 44.1%)
      8192       675.19 TFLOPS ( 13.5%)    3.64 TBPS ( 45.6%)       629.79 TFLOPS ( 12.6%)    3.40 TBPS ( 42.5%)
      9216       729.35 TFLOPS ( 14.6%)    3.62 TBPS ( 45.2%)       674.98 TFLOPS ( 13.5%)    3.35 TBPS ( 41.9%)
     10240       749.61 TFLOPS ( 15.0%)    3.40 TBPS ( 42.4%)       685.04 TFLOPS ( 13.7%)    3.10 TBPS ( 38.8%)
     11264       806.46 TFLOPS ( 16.1%)    3.34 TBPS ( 41.8%)       721.29 TFLOPS ( 14.4%)    2.99 TBPS ( 37.3%)
     12288       895.78 TFLOPS ( 17.9%)    3.38 TBPS ( 42.3%)       805.38 TFLOPS ( 16.1%)    3.04 TBPS ( 38.0%)
     13312      1008.57 TFLOPS ( 20.2%)    3.53 TBPS ( 44.1%)       921.30 TFLOPS ( 18.4%)    3.22 TBPS ( 40.3%)
     14336      1001.71 TFLOPS ( 20.0%)    3.24 TBPS ( 40.5%)       915.14 TFLOPS ( 18.3%)    2.96 TBPS ( 37.0%)
     15360      1108.54 TFLOPS ( 22.2%)    3.35 TBPS ( 41.8%)      1020.08 TFLOPS ( 20.4%)    3.08 TBPS ( 38.5%)
     16384      1182.28 TFLOPS ( 23.6%)    3.41 TBPS ( 42.6%)      1083.60 TFLOPS ( 21.7%)    3.12 TBPS ( 39.0%)
     17408      1255.35 TFLOPS ( 25.1%)    3.44 TBPS ( 43.0%)      1106.65 TFLOPS ( 22.1%)    3.03 TBPS ( 37.9%)
     18432      1339.98 TFLOPS ( 26.8%)    3.50 TBPS ( 43.7%)      1188.63 TFLOPS ( 23.8%)    3.10 TBPS ( 38.8%)
     19456      1411.14 TFLOPS ( 28.2%)    3.49 TBPS ( 43.6%)      1239.33 TFLOPS ( 24.8%)    3.06 TBPS ( 38.3%)
     20480      1465.49 TFLOPS ( 29.3%)    3.46 TBPS ( 43.3%)      1329.22 TFLOPS ( 26.6%)    3.14 TBPS ( 39.2%)
     21504      1429.77 TFLOPS ( 28.6%)    3.23 TBPS ( 40.3%)      1228.47 TFLOPS ( 24.6%)    2.77 TBPS ( 34.7%)
     22528      1432.90 TFLOPS ( 28.7%)    3.08 TBPS ( 38.5%)      1283.04 TFLOPS ( 25.7%)    2.76 TBPS ( 34.5%)
     23552      1530.40 TFLOPS ( 30.6%)    3.17 TBPS ( 39.6%)      1304.38 TFLOPS ( 26.1%)    2.70 TBPS ( 33.8%)
     24576      1531.43 TFLOPS ( 30.6%)    3.05 TBPS ( 38.1%)      1378.61 TFLOPS ( 27.6%)    2.74 TBPS ( 34.3%)
     25600      1609.22 TFLOPS ( 32.2%)    3.09 TBPS ( 38.7%)      1403.54 TFLOPS ( 28.1%)    2.70 TBPS ( 33.7%)
     26624      1591.86 TFLOPS ( 31.8%)    2.96 TBPS ( 37.0%)      1461.78 TFLOPS ( 29.2%)    2.72 TBPS ( 34.0%)
     27648      1708.82 TFLOPS ( 34.2%)    3.08 TBPS ( 38.5%)      1554.22 TFLOPS ( 31.1%)    2.80 TBPS ( 35.0%)
     28672      1715.65 TFLOPS ( 34.3%)    3.00 TBPS ( 37.5%)      1582.07 TFLOPS ( 31.6%)    2.77 TBPS ( 34.6%)
     29696      1756.95 TFLOPS ( 35.1%)    2.98 TBPS ( 37.3%)      1562.01 TFLOPS ( 31.2%)    2.65 TBPS ( 33.1%)
     30720      1801.20 TFLOPS ( 36.0%)    2.97 TBPS ( 37.2%)      1681.00 TFLOPS ( 33.6%)    2.77 TBPS ( 34.7%)
     31744      1979.10 TFLOPS ( 39.6%)    3.17 TBPS ( 39.7%)      1754.99 TFLOPS ( 35.1%)    2.82 TBPS ( 35.2%)
```

And on uniform logits

```
batch_size                                            example                                          reference
----------------------------------------------------------------------------------------------------------------
       128        73.67 TFLOPS (  1.5%)    5.41 TBPS ( 67.7%)        60.70 TFLOPS (  1.2%)    4.46 TBPS ( 55.7%)
       160        96.46 TFLOPS (  1.9%)    5.50 TBPS ( 68.7%)        78.10 TFLOPS (  1.6%)    4.45 TBPS ( 55.6%)
       192       115.43 TFLOPS (  2.3%)    5.43 TBPS ( 67.9%)        94.71 TFLOPS (  1.9%)    4.45 TBPS ( 55.7%)
       224       141.31 TFLOPS (  2.8%)    5.47 TBPS ( 68.3%)       115.26 TFLOPS (  2.3%)    4.46 TBPS ( 55.7%)
       256       155.77 TFLOPS (  3.1%)    5.47 TBPS ( 68.4%)       126.31 TFLOPS (  2.5%)    4.44 TBPS ( 55.5%)
       320       205.17 TFLOPS (  4.1%)    5.46 TBPS ( 68.2%)       161.57 TFLOPS (  3.2%)    4.30 TBPS ( 53.7%)
       384       251.10 TFLOPS (  5.0%)    5.50 TBPS ( 68.8%)       195.10 TFLOPS (  3.9%)    4.28 TBPS ( 53.5%)
       448       295.50 TFLOPS (  5.9%)    5.53 TBPS ( 69.2%)       228.63 TFLOPS (  4.6%)    4.28 TBPS ( 53.5%)
       512       316.07 TFLOPS (  6.3%)    5.34 TBPS ( 66.7%)       254.07 TFLOPS (  5.1%)    4.29 TBPS ( 53.6%)
       640       414.90 TFLOPS (  8.3%)    5.43 TBPS ( 67.9%)       293.48 TFLOPS (  5.9%)    3.84 TBPS ( 48.0%)
       768       489.21 TFLOPS (  9.8%)    5.43 TBPS ( 67.9%)       350.19 TFLOPS (  7.0%)    3.89 TBPS ( 48.6%)
       896       553.62 TFLOPS ( 11.1%)    5.39 TBPS ( 67.4%)       405.50 TFLOPS (  8.1%)    3.95 TBPS ( 49.4%)
      1024       576.71 TFLOPS ( 11.5%)    4.92 TBPS ( 61.5%)       463.70 TFLOPS (  9.3%)    3.95 TBPS ( 49.4%)
      1280       682.95 TFLOPS ( 13.7%)    4.76 TBPS ( 59.5%)       571.87 TFLOPS ( 11.4%)    3.98 TBPS ( 49.8%)
      1536       837.25 TFLOPS ( 16.7%)    4.99 TBPS ( 62.4%)       679.20 TFLOPS ( 13.6%)    4.05 TBPS ( 50.6%)
      1792       934.09 TFLOPS ( 18.7%)    4.84 TBPS ( 60.5%)       686.46 TFLOPS ( 13.7%)    3.56 TBPS ( 44.5%)
      2048       937.04 TFLOPS ( 18.7%)    4.25 TBPS ( 53.2%)       714.85 TFLOPS ( 14.3%)    3.24 TBPS ( 40.6%)
      2560      1081.82 TFLOPS ( 21.6%)    4.07 TBPS ( 50.9%)      1020.85 TFLOPS ( 20.4%)    3.84 TBPS ( 48.0%)
      3072      1313.97 TFLOPS ( 26.3%)    4.21 TBPS ( 52.7%)      1209.26 TFLOPS ( 24.2%)    3.88 TBPS ( 48.5%)
      3584      1533.25 TFLOPS ( 30.7%)    4.27 TBPS ( 53.3%)      1410.67 TFLOPS ( 28.2%)    3.93 TBPS ( 49.1%)
      4096      1399.52 TFLOPS ( 28.0%)    3.42 TBPS ( 42.7%)      1245.96 TFLOPS ( 24.9%)    3.04 TBPS ( 38.1%)
      5120      1489.01 TFLOPS ( 29.8%)    2.97 TBPS ( 37.1%)      1320.69 TFLOPS ( 26.4%)    2.63 TBPS ( 32.9%)
      6144      1901.82 TFLOPS ( 38.0%)    3.21 TBPS ( 40.2%)      1638.02 TFLOPS ( 32.8%)    2.77 TBPS ( 34.6%)
      7168      2234.87 TFLOPS ( 44.7%)    3.28 TBPS ( 41.0%)      1987.25 TFLOPS ( 39.7%)    2.92 TBPS ( 36.5%)
      8192      1999.71 TFLOPS ( 40.0%)    2.59 TBPS ( 32.4%)      1747.04 TFLOPS ( 34.9%)    2.26 TBPS ( 28.3%)
      9216      2024.82 TFLOPS ( 40.5%)    2.38 TBPS ( 29.7%)      1802.45 TFLOPS ( 36.0%)    2.12 TBPS ( 26.5%)
     10240      2233.54 TFLOPS ( 44.7%)    2.38 TBPS ( 29.8%)      1987.52 TFLOPS ( 39.8%)    2.12 TBPS ( 26.5%)
     11264      2423.11 TFLOPS ( 48.5%)    2.40 TBPS ( 30.0%)      2163.00 TFLOPS ( 43.3%)    2.14 TBPS ( 26.8%)
     12288      2230.25 TFLOPS ( 44.6%)    2.05 TBPS ( 25.7%)      1971.60 TFLOPS ( 39.4%)    1.82 TBPS ( 22.7%)
     13312      2386.76 TFLOPS ( 47.7%)    2.07 TBPS ( 25.9%)      2089.59 TFLOPS ( 41.8%)    1.81 TBPS ( 22.7%)
     14336      2535.85 TFLOPS ( 50.7%)    2.08 TBPS ( 25.9%)      2214.30 TFLOPS ( 44.3%)    1.81 TBPS ( 22.7%)
     15360      2708.63 TFLOPS ( 54.2%)    2.10 TBPS ( 26.3%)      2371.09 TFLOPS ( 47.4%)    1.84 TBPS ( 23.0%)
     16384      2495.78 TFLOPS ( 49.9%)    1.85 TBPS ( 23.1%)      2256.94 TFLOPS ( 45.1%)    1.67 TBPS ( 20.9%)
     17408      2613.94 TFLOPS ( 52.3%)    1.85 TBPS ( 23.1%)      2350.97 TFLOPS ( 47.0%)    1.66 TBPS ( 20.8%)
     18432      2600.95 TFLOPS ( 52.0%)    1.76 TBPS ( 22.0%)      2376.24 TFLOPS ( 47.5%)    1.61 TBPS ( 20.1%)
     19456      2720.43 TFLOPS ( 54.4%)    1.77 TBPS ( 22.1%)      2445.94 TFLOPS ( 48.9%)    1.59 TBPS ( 19.9%)
     20480      2694.19 TFLOPS ( 53.9%)    1.68 TBPS ( 21.0%)      2418.54 TFLOPS ( 48.4%)    1.51 TBPS ( 18.9%)
     21504      2667.33 TFLOPS ( 53.3%)    1.61 TBPS ( 20.1%)      2432.76 TFLOPS ( 48.7%)    1.47 TBPS ( 18.4%)
     22528      2773.18 TFLOPS ( 55.5%)    1.62 TBPS ( 20.3%)      2531.76 TFLOPS ( 50.6%)    1.48 TBPS ( 18.5%)
     23552      2867.55 TFLOPS ( 57.4%)    1.62 TBPS ( 20.3%)      2604.20 TFLOPS ( 52.1%)    1.48 TBPS ( 18.4%)
     24576      2736.19 TFLOPS ( 54.7%)    1.50 TBPS ( 18.8%)      2499.52 TFLOPS ( 50.0%)    1.37 TBPS ( 17.2%)
     25600      2789.45 TFLOPS ( 55.8%)    1.49 TBPS ( 18.6%)      2554.81 TFLOPS ( 51.1%)    1.37 TBPS ( 17.1%)
     26624      2882.35 TFLOPS ( 57.6%)    1.50 TBPS ( 18.7%)      2669.93 TFLOPS ( 53.4%)    1.39 TBPS ( 17.4%)
     27648      2856.47 TFLOPS ( 57.1%)    1.45 TBPS ( 18.1%)      2624.29 TFLOPS ( 52.5%)    1.33 TBPS ( 16.7%)
     28672      2851.52 TFLOPS ( 57.0%)    1.41 TBPS ( 17.7%)      2621.20 TFLOPS ( 52.4%)    1.30 TBPS ( 16.2%)
     29696      2826.90 TFLOPS ( 56.5%)    1.37 TBPS ( 17.1%)      2483.67 TFLOPS ( 49.7%)    1.20 TBPS ( 15.1%)
     30720      2894.51 TFLOPS ( 57.9%)    1.37 TBPS ( 17.2%)      2527.41 TFLOPS ( 50.5%)    1.20 TBPS ( 15.0%)
     31744      2875.63 TFLOPS ( 57.5%)    1.34 TBPS ( 16.7%)      2505.68 TFLOPS ( 50.1%)    1.16 TBPS ( 14.5%)
```
[GSan] Model TMAReduceOp as atomic

Previously we were modelling it as a non-atomic store. This now models
it as atomic, but is still not 100% right as we're doing the shadow cell
update eagerly before the TMA operation has completed.
…ale with nvfp4 output (#9854)

nvfp4 has a tensor-wide scale. This PR adds support for this scale when
the matmul output needs to be quantized to nvfp4. We use flex ctx to
carry the scale.


<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
* Convert `mask` to a tensor so it broadcasts correctly in gluon. NFC
for triton
* fix the `tl_obj_*` helpers
* add an extra hook for rewriting some argument annotations
We make things a bit more uniform by decomposing the xor mask and
emitting instructions accordingly. For a `mask` in [1, 15], on RDNA +
gfx1250 we use a single `row_xmask` DPP instruction, and on CDNA we use
1 or 2 DPP instructions. For a `mask >= 16`, on RDNA, we use a single
`v_permlanex16`, and on CDNA we use `ds_bpermute`.

We also pull some static utility functions into an anonymous namespace
and remove the ShflKind::down case from the enum as it was
unimplemented.
Improve f16 gemm gfx1250-gluon performance. Improves
gemm_tdm_pipelined_single_warp_per_simd_schedule_kernel by moving
tdm.load earlier; from the top of the loop (which hides 3/4th of a
loop-iteration's worth of cycles) to right after the wait (which hides a
full loop-iteration's worth of cycles).
This only fixes the mentioned kernel; other kernels need independent
benchmarking and improving.
CLC gets its own partition, running over threads 48-63.

We model CLC as we model TMA writes, via a Barrier::EffectWrites.
The idea of this mode is that we link all the writes on the op to the
barrier. We also annotate in the table `barrierWriteRecipients` which
CTAs will become visible once we wait on the associated barrier.

We note something interesting and document it.
`BarrierTrackingMode::Frontier` should be used when we have a
commit/arrive/expect op that affects anything in flight before it.
Instead, we use `BarrierTrackingMode::EffectWrites` when the PTX op
accepts a barrier so the barrier just signals the completion of the op's
particular write.

The other point we add is a flag `bool diagonalEffectRecipientCTAs`.
This differentiates the behaviour between TMA, where after waiting on
the barrier you see all the writes from all the CTAs in the multicas
group, vs. the diagonal version, as in CLC, where waiting on CTAi just
makes the thread see the CTAi memory.
Hello! The coalescing pass currently chooses `sizePerThread` for atomics
using the same pointer alignment / contiguity heuristic as regular
stores.

For some atomic lowerings this is too optimistic: the pass will select a
128-bit per-thread layout even when the backend lowers the atomic as a
narrower operation. That creates gaps in the warp-level write pattern
and can hurt throughput significantly.

On the attached gh200 atomic-add microbenchmark, this PR improves Triton
from:

- `int32`: 17.98 -> 72.86 Gupdates/s (~4.1x)
- `int64`: 19.29 -> 39.21 Gupdates/s (~2.0x)
- `fp64`: 13.42 -> 26.79 Gupdates/s (~2.0x)

These numbers match the best Gluon blocked layouts, which use
`size_per_thread = 1` for `int32` / `int64` / `fp64`.

The attached microbenchmark focuses on atomic add, and we have only
tested on Hopper, but we think it could be worth building a richer
lowering path for different compute capabilities and dtypes, as more
people realise the performance benefits of associative int32 atomics.

Thanks @leijurv for input on these commits.

<details>
<summary><b>Raw results</b> (click to expand)</summary>

```
int32 results:
    baseline triton atomic took     382.13 ms @ 17.98 Gupdates/s
    patched triton atomic took      94.32 ms  @ 72.86 Gupdates/s
    Gluon [8] atomic took           706.87 ms @  9.72 Gupdates/s
    Gluon [4] atomic took           381.84 ms @ 18.00 Gupdates/s
    Gluon [2] atomic took           197.41 ms @ 34.81 Gupdates/s
    Gluon [1] atomic took           94.88 ms  @ 72.43 Gupdates/s

int64 results:
    baseline triton atomic took     356.30 ms @ 19.29 Gupdates/s
    patched triton atomic took      175.25 ms @ 39.21 Gupdates/s
    Gluon [8] atomic took           773.59 ms @  8.88 Gupdates/s
    Gluon [4] atomic took           684.48 ms @ 10.04 Gupdates/s
    Gluon [2] atomic took           356.34 ms @ 19.28 Gupdates/s
    Gluon [1] atomic took           174.56 ms @ 39.37 Gupdates/s

fp64 results:
    baseline triton atomic took     512.20 ms @ 13.42 Gupdates/s
    patched triton atomic took      256.55 ms @ 26.79 Gupdates/s
    Gluon [8] atomic took           773.32 ms @  8.89 Gupdates/s
    Gluon [4] atomic took           683.84 ms @ 10.05 Gupdates/s
    Gluon [2] atomic took           512.01 ms @ 13.42 Gupdates/s
    Gluon [1] atomic took           253.05 ms @ 27.16 Gupdates/s
```

</details>

<details>
<summary><b>Microbenchmark code</b> (click to expand)</summary>

```py
import torch
import triton
import triton.language as tl
import triton.experimental.gluon as gluon
import triton.experimental.gluon.language as gl

@triton.jit
def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"):
    program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID)
    
    load_indices = input + load_store_indices
    input = tl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in tl.static_range(INNER_REPEATS):
        tl.atomic_add(store_indices, input, sem=SEM)
        
@gluon.jit
def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"):
    LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0])
    program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID
    load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT)
    
    load_indices = input + load_store_indices
    input = gl.load(load_indices)
    
    store_indices = output + load_store_indices
    for _ in gl.static_range(INNER_REPEATS):
        gl.atomic_add(store_indices, input, sem=SEM)
        
def make_input(total_elements, dtype):
    if dtype.is_floating_point:
        return torch.rand((total_elements,), dtype=dtype)
    else:
        return torch.randint(-1000, 1000, (total_elements,), dtype=dtype)
    
if __name__ == "__main__":
    torch.set_default_device('cuda')
    total_elements = 2**30
    ELEMENTS_PER_PID = 2**9
    grid = total_elements // ELEMENTS_PER_PID
    INNER_REPEATS = 2**6

    for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]:
        print(f'\tDTYPE: {dtype}')
        input = make_input(total_elements, dtype)
        output_triton = torch.zeros_like(input)
        
        def triton_kernel_launcher():
            triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4)
            
        ms_triton = triton.testing.do_bench(triton_kernel_launcher)
        triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6
        print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s')
            
        for SIZE_PER_THREAD in [8, 4, 2, 1]:
            def gluon_kernel_launcher():
                gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4)
            ms_gluon = triton.testing.do_bench(gluon_kernel_launcher)
            gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6
            print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s')
```

</details>


# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
The test has been failing increasingly freqently.
Disable while investigating.
…10066)

The subslice safety check in lowerLdStMatrix uses a bitmask to verify
that affine offsets don't touch the contiguous part of the tile's offset
dimension. It was using getOutDimSizeLog2 (which returns log2 of the
size) instead of getOutDimSize (the actual size) to construct this mask.

For outDimSize=8: log2(8)-1 = 2 (0b010) only checks bit 1, whereas 8-1 =
7 (0b111) correctly checks all bits within the tile span.

The bug makes the check too permissive — it could allow subslices that
overlap the contiguous tile region. Latent because the specific bit
patterns in maskSpanAffineOffset rarely trigger the difference.

Fix both the NVIDIA (Utility.cpp) and AMD (MemoryOpToLLVM.cpp) backends.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

  - [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- [x] This PR does not need a test because existing tests pass, and the
fix only makes the safety check stricter .

  - [x] I have not added any `lit` tests.
…ons. (#10060)

The fp64 MMA path now operates at native `m8n8k4` granularity,
supporting any shape that is a multiple of 8×8×4, including the minimal
8×8×4 case.

This is an extension of #7310
(The implementation was based on that PR)

Tests passed on A100.

## Files Changed

### `lib/Dialect/TritonGPU/IR/Dialect.cpp`
- `getRepForOperand`: Changed `tileBitWidthK` from `2 * 256` to `1 *
256` for fp64 (K-tile = 4). Changed `tileSize[M]` from hardcoded `16` to
`8` for fp64.

### `lib/Dialect/TritonGPU/Transforms/Utility.cpp`
- `mmaVersionToInstrShape`: Returns `instrShape[M] = 8` for fp64 (was
always 16).

### `lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp`
- `nvidiaDotToLinearLayout`: Uses `instrShape` from the MMA encoding for
tile shape computation. K tile multiplier is 4 (not 8) when `instrM ==
8`.

### `third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/MMAv2.cpp`
- `getMmaRetType`: fp64 returns `struct{f64, f64}` (2 elements) instead
of `struct{f64, f64, f64, f64}` (4 elements).
- `callMmaAmpereFp64`: Extended: Now it is able to also emit a single
`m8n8k4` instruction per call (single retArgs(2), aArgs(1), bArgs(1),
cArgs(2)).
- `numRegisters`: `{1, 1, 1}` for fp64 (was effectively `{2, 1, 2}`).
- `numMmaRets`: 2 for fp64 (was 4).
- `numCPackedElem`: 1 for fp64 (was incorrectly computed).
- fc indexing formula: Uses `numMmaRets * numCPackedElem` instead of
hardcoded `4`.

### `third_party/nvidia/backend/compiler.py`
- `min_dot_size`: Added `elif lhs_bitwidth == 64: return (1, 1, 4)` to
allow K=4 for fp64.

### `python/test/unit/language/test_core.py`
- Added small fp64 test cases: `(8,8,4)`, `(8,8,8)`, `(16,8,4)`,
`(8,8,16)` with `num_warps=1`.

### `test/Conversion/tritongpu_to_llvm.mlir`
- Updated `f64_mma_cvt` test to use `instrShape = [8, 8]` matching the
new fp64 encoding.

----


# New contributor declaration
- [X] I am not making a trivial change, such as fixing a typo in a
comment.

- [X] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [X] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [X] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
Partially reverts:
- #9730

it still keeps/adjusts the improved verifiers to error out if the kernel
provides invalid strides. This brings it in line with TMA which also
only supports the logical HW view and transposes the view in the kernel
instead.
This renames both the ampere style async_copy and tma functions. The
original names are pretty much copied directly from ptx and while they
are very descriptive, are also rather verbose.

IMO these shorter names have a better glance-value and better conveys
the intent of the programmer.

I've also kept the original names as aliases to preserve
backward-compatibility.
#10085)

Reverts #10054 given now
the mirror out of sync isue is gone.
This PR:
- Enables CS and CV load cache modifier in Nvidia backend
- Enables CS cache modifier in frontend
- Improves test coverage for CDNA3 and CDNA4

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
…10062)

Extends the AMD ds_read_tr* local-load lowering to accept
PartitionedSharedEncodingAttr as the source encoding.
Previously the pattern bailed out as soon as it saw a partitioned shared
encoding, forcing a slower generic
local-load lowering for all WMMA dot-operand loads from partitioned LDS
buffers.

---------

Co-authored-by: Ognjen Plavsic <plognjen@amd.com>
Fpsan insufficiently synchronized global scratch accesses - we were
missing a barrier in between memory accesses, which led to random
failures in the tests and racy behavior.
adityasingh2400 and others added 30 commits May 26, 2026 14:13
…#10356)

The current docstrings for `tl.multiple_of`, `tl.max_contiguous`, and
`tl.max_constancy` do not match the underlying semantics modeled by
`AxisInfo`. `multiple_of` describes its argument as "multiples of value"
without explaining how the hint composes with contiguity, and the other
two use the phrase "the value first values in input are
contiguous/constant", which conflates the hint with its per-dimension
length argument and confused at least one reader (#1324).

This PR rewrites each docstring to mirror the precise definitions in
`include/triton/Analysis/AxisInfo.h`. `multiple_of` now states that
`values[d]` is the largest power of two that divides the first element
of every contiguous group along dimension `d`. `max_contiguous` and
`max_constancy` now say that the input forms contiguous (resp. constant)
groups of length `values[d]` along dimension `d`, that `values` must
have one entry per dimension, and that each entry must be a power of
two.

Each docstring includes a concrete 1D example taken from `AxisInfo.h` so
the meaning is obvious from the rendered Sphinx page. `multiple_of` and
`max_contiguous` now cross-reference each other via `:func:` so users
can see how they compose to enable vectorized memory accesses. Behavior
is unchanged; this is documentation only.

Closes #1324.
they can still separately be overriden in the opt flags constraints
`K`-ragged metadata is not row-segment metadata for
`BlackwellActMXScaleLayout`; passing it corrupts matmul output with
swizzled activation scales.

Test: GB200 `test_k_ragged_mxfp8_act_scale_swizzling` fails on `main`
(77.3% mismatched elements) and passes here.
<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

This solves issue #10341

When specifying the `acc` argument of `tl.dot` with dtype `tl.float16`
without `out_dtype=tl.float16` raises an `AssertionError` because the
default for `out_dtype` is hard-coded to be `tl.float32`.
This PR changes the default `out_dtype` to be `None` to allow checking
whether a value has been provided.
This again allows to use the dtype of `acc` as the default for
`out_dtype` (if the `acc` argument is specified).

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
…10386)

More accurately account the smem usage of the persistent tf32 matmul
kernel. This makes it so that we correctly enable 4-stage for tf32
persistent matmul when B is provided in the correct layout.

For M=N=K=4096 matmul this leads to an improvement of about ~13% on
GB200, from 560 TFLOP/s to 630 TFLOP/s.

Validated that both NNN and NNT matmuls still run. Note that NNN does
_not_ run with 4 stages, so some level of accounting / capping is
necessary.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because `it is a straightforward
heuristics change`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
…put contiguity (#10359)

Fixes #10067.

In `SelectOpAxisInfoVisitor`'s tensor-cond branch, the call to
`getDivisibilityFromContiguity` only sees the lhs/rhs contiguities and
can overestimate divisibility when `condConstancy` further reduces the
output contiguity below either input's contiguity.

### Concrete example

- lhs: `[8, 9, 10, 11, 12, 13, 14, 15]` (c=8, d=8)
- rhs: `[16, 17, 18, 19, 20, 21, 22, 23]` (c=8, d=16)
- Element-wise condition, so `condConstancy = 1`

Output contiguity collapses to `gcd(8, 8, 1) = 1` (every position is a
leader). But `getDivisibilityFromContiguity` sees `c_lhs == c_rhs == 8`
and returns `gcd(8, 16) = 8` — without accounting for `condConstancy`.
The output value at position 1 may be 17, which is not divisible by 8.

### Why this didn't blow up

On the current pow2 lattice, `gcd == min` on powers of 2. Codegen's
`vec_width = min(c, d/e, ...)` is bounded by contiguity, and contiguity
is computed correctly. So the overestimated divisibility is never the
binding constraint on `vec_width`. This is a latent soundness regression
introduced by #7781 — the pre-#7781 code clamped `divisibility` against
the just-computed output contiguity, and that was sound independent of
the pow2 invariant.

### Fix

A conditional GCD with the output contiguity at the SelectOp callsite.
The clamp fires only when `condConstancy` (or another shrinking factor)
reduces the output contiguity strictly below at least one input's
contiguity — i.e. it preserves the existing semantics when
`condConstancy` is non-binding.

The helper `getDivisibilityFromContiguity` is left unchanged (its other
callers — `MaxMinOpAxisInfoVisitor` and `AxisInfo::join` — don't have a
`condConstancy`-equivalent, so the same gap doesn't exist there).

### Test

Added `select_cond_constancy_clamps_divisibility` in
`test/Analysis/test-alignment.mlir`. The test fails before the fix
(`divisibility = [8]`) and passes after (`divisibility = [1]`).

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
#10390)

`InThreadTranspose` rewrites `tt.load -> ttg.local_alloc ->
ttg.local_load -> dot_op` so the K-contiguous WMMA/MFMA operand can be
read from LDS as wide `ds_load_b128` instead of scalar `ds_load_u16`
pairs when the load order doesn't match the consumer's K dimension.
The pattern matcher in `matchInThreadTransposePattern` already accepts
`AMDWmmaEncodingAttr` alongside `AMDMfmaEncodingAttr`, but the gate in
`is_in_thread_transpose_enabled` only activates the pass on gfx942
(CDNA3) and gfx120x (RDNA4, enabled in #10185). Extend it to also
cover RDNA3 (gfx110x/gfx1103) and RDNA3.5 (gfx115x).

Added a `inThreadTranspose_wmma` sub-test to
`test/TritonGPU/amd/in-thread-transpose.mlir` (gfx1151, wave32, WMMA
encoding) that verifies the pass produces an `amdg.in_thread_transpose`
op and that the downstream `ttg.local_load` returns the K-contiguous
`dot_op` layout (`kWidth = 16`).

On AITER's `flash_attn_2.varlen_fwd` at the Qwen3-Omni ViT prefill
shape (B=1, S=3200, H=16, head_dim=72, fp16) on gfx1151, this lifts
the inner-loop V `local_load` from 512 scalar `ds_load_u16(_d16_hi)`
to 144 vectorized `ds_load_b128` and gives a 3.8% median speedup
(3.042 -> 2.925 ms).
Motivation for this PR is to enable kernel authors to feed/select
different llvm fn attributes locally at a kernel level scope as
optimizations/llvm fn attributes could differ between kernels. This PR
has 3 main changes:


1. Extend the LLVM function binding so add_fn_attr can emit valueless
attributes. LLVM distinguishes bare function attributes from
string-valued attributes, and AMD kernel attrs may need either form.
2. Add the HIP llvm_fn_attrs option so kernel authors can attach
comma-separated LLVM function attributes to their own kernel during
compilation without changing global backend state.
3. Add AMD tests covering single attributes, multiple attributes
including a bare attr, and a sched-strategy attr that changes generated
AMDGCN.

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
…layout (#10392)

It is to the fix problem arises when `layoutToGluon()` converting a
`PaddedSharedEncodingAttr` (w/ multicta) into a PaddedSharedLayout
object.

### Problem
Consider the case where PaddedSharedEncodingAttr has shape=[X, 128] and
CGA-layout=[0,1],
* Before the fix: `PaddedSharedLayout.cga_layout` is built directly from
`PaddedSharedEncodingAttr`'s "block" basis. It will be `[0, 64=128/2]`
  * After the fix: `PaddedSharedLayout.cga_layout` will be `[0, 1]`

The rationale can also be extrapolated from the
`PaddedSharedLayout::_to_ir()` where the linear-layout is obtained by
`operator*(PaddedSharedLayout.offset_bases,
PaddedSharedLayout.cga_layout)` via function
`get_padded_shared_layout()`

### Test
* command `python3 -m pytest -v
python/test/gluon/test_frontend.py::test_infer_layout_for_padded_shared`
* the `ttgl.static_assert()` in the kernel makes sure CGA layout is
expected
Currently, running `python
python/examples/gluon/01-attention-forward.py` fails with

```
python/examples/gluon/01-attention-forward.py:386:14: error: source and result must have the same logical storage size (4096 vs 2048)
    p_tmem = s_tmem._reinterpret(config.dtype, [config.SPLIT_M, 2 * config.BLOCK_N], config.p_tmem_layout)
```

This happens when the benchmark is run with fp8. `pytest
python/examples/gluon/01-attention-forward.py` doesn't fail simply
because it doesn't test on fp8.

The regression is from #10243
but the problematic code hardcoding a fixed column count like
`config.BLOCK_N // 2` has been there for a long time. Previously it only
worked due to the loose convention on reinterpret.
## Summary
Improve error diagnostics for user-facing `assert` statements in
`semantic.py` and `core.py` by adding descriptive messages following
the "expected X, got Y" pattern.

As requested by @lezcano in #10341 — previously, users hitting these
asserts (e.g., mismatched accumulator dtype in `tl.dot`) would get an
empty `AssertionError` with no guidance on what went wrong.

## Changes
- `semantic.py`: 17 bare asserts → descriptive messages
- `core.py`: 5 bare asserts → descriptive messages
- Split compound `assert shape == X and dtype == Y` into two separate
  asserts for clearer diagnostics
This reverts commit 36394d4. We saw
some compile failure cases with disabled vector combine. But should be
resolved and should re-land this commit once with
llvm/llvm-project#193499 or commit
`81d618b6bc1e71cda79fe7bf9cbab63933dd5975` gets picked up in the next
LLVM bump.
This PR adds check that tensor descriptor and smem encodings are equal.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
We expect two different types of failures in this 2CTA race:
- The 1 cta one
- Have CTA1 race ahead (as it's doing nothing on tcgen05 ops) then
  load/store smem, and invalidate the barrier, before the barrier has
  finished being used by tcgen05_commit.

We also remove the TWO_CTA flag in the TMA barrier as it doesn't make
sense here.
Update the separate Blackwell ptxas pin from CUDA 13.1.80 to CUDA
13.3.33 so Blackwell builds use the CUDA 13.3 assembler artifact.
Use `string_view` and `packFixStrLiteral`
Introduce a build number to the LLVM build to avoid accidentally
changing the LLVM that is used when building Triton. This allows us to
have multiple builds from the same LLVM hash, but ensures that a given
Triton commit builds deterministically because it is tied to the build
number.

Use the sha sum as well to enforce this.
Make `triton_kernels.tensor.convert_layout` return its input when its
storage already represents the requested layout.

## Motivation

Callers should not need to guard an idempotent layout conversion. This
avoids redundant unswizzle/swizzle copies for already-correct MXFP
intermediates.

### Strided layouts

`StridedLayout` identifies the packed dimension, not compact physical
strides. For example:

```python
pitched = torch.empty_strided((64, 128), (256, 1))
tensor = wrap_torch_tensor(pitched)
assert tensor.storage.layout == StridedLayout(-1)

converted = convert_layout(tensor, StridedLayout(-1))
```

Before this PR, the matching-layout conversion still copied through
canonical storage and incidentally densified the tensor:

```python
assert converted is not tensor
assert converted.storage.data.stride() == (128, 1)
```

After this PR, the existing valid storage is preserved:

```python
assert converted is tensor
assert converted.storage.data.stride() == (256, 1)
```

Both strides are valid `StridedLayout(-1)` storage. Kernels and TMA
checks consume physical strides separately. Invalid strided storage with
no contiguous dimension is still rejected.

### Swizzled layouts

Matching swizzled layouts are also already in the requested storage
encoding. For example:

```python
swizzled = convert_layout(tensor, HopperMXScaleLayout(-2, 8))
assert convert_layout(swizzled, HopperMXScaleLayout(-2, 8)) is swizzled
```

This avoids an unnecessary unswizzle/swizzle round trip and preserves
any layout-specific padded storage. A different parameterized layout
still converts normally:

```python
assert convert_layout(swizzled, HopperMXScaleLayout(-2, 4)) is not swizzled
```

`convert_layout` changes storage encoding. It does not clone, densify,
or scrub padding when the existing storage is already valid for the
requested layout.
Now that the AMDGPU backend supports Memory Model Relaxation
hannotations (MMRAs) on fences to indicate that only LDS / only global
memory should be waited on, there is no need to manually insert waitcnt
instructions, as the backend will correctly insert only the requested
waits. This makes the atomic behavior we're looking for transparent to
LLVM IR and avoids low-level intrinsics that require a bunch of complex
bit-packing logic the backend already knows how to do.

There's been a soft request from the backend folks for people to migrate
to this form of fence if possible (though we at least have the benefit
that we're not using inline assembly).

AI disclosure: I wrote up what needed doing and had Codex do the work
here (though I have checked the logic and it looks correct).
Fast tree traversal and remove unused methods
# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

## Summary

Make local and remote cache group reads reject incomplete groups instead
of returning a partial mapping. A partial group can make compilation
treat a corrupted cache entry as a cache hit, only to fail later when
loading the missing binary or IR artifact.

## What changed

- Return `None` from `FileCacheManager.get_group()` if any recorded
child path is missing.
- Return `None` from `RemoteCacheManager.get_group()` if the remote
backend does not return all recorded children.
- Add regression coverage for both local and remote cache managers.

## Testing

- `pre-commit run --from-ref origin/main --to-ref HEAD`
- Added Python unit coverage in `python/test/unit/runtime/test_cache.py`

Co-authored-by: Kevin-Li-2025 <2242139@qq.com>
Skip empty leaf nodes. We don't skip empty internal nodes to make the
code simpler and more efficient. In fact, most empty nodes are leaves
(kernels)
This updates Triton's Blackwell-specific packaged CUPTI toolchain entry
to the published CUDA 13.3.35 artifact. This fix a memory leak in cupti
…10415)

Currently desc.store(...) does not guaruntee that that write is
completed to global memory, so this makes message passing impossible.

e.g.
```
desc.store(...)
tl.atomic_xchg(flag, 1, sem="release")
```

does not release the store.

To maintain perf in gluon, I also expose the read-only tma wait variant
so users can explicitly opt-in to the behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⤵️ pull merge-conflict Resolve conflicts manually

Projects

None yet

Development

Successfully merging this pull request may close these issues.